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Abstract 

This paper proposes a new probabilistic classification algorithm using a Markov 
random field approach. The joint distribution of class labels is explicitly mod- 
elled using the distances between feature vectors. Intuitively, a class label should 
depend more on class labels which are closer in the feature space, than those 
which are further away. Our approach builds on previous work by Holmes and 
(N Adams ( [20021 [20031 ) and Cucala et al (|2009|). Our work shares many of the 
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advantages of these approaches in providing a probabilistic basis for the sta- 
^ tistical inference. In comparison to previous work, we present a more efficient 

computational algorithm to overcome the intractability of the Markov random 
field model. The results of our algorithm are encouraging in comparison to the 
A;— nearest neighbour algorithm. 



1 Introduction 

This paper is concerned with the problem of supervised classification, a topic of interest 



in both statistics and machine learning. Hastie et al (2001) gives a description of 
'i> various classification methods. We outline our problem as follows. We have a collection 

^ of training data {(xj, yi),i = 1, . . . , n}. The values in the collection x = {xi, . . . , Xn} 

0\ are often called features and can be conveniently thought of as covariates. We denote 

^ the class labels as y = {yi, . . . , ?/„}, where each yi takes one of the values 1,2, ... ,G. 

^ Given a collection of incomplete/unlabelled test data {{xi, yi),i = n + 1, . . . ,n + m}, 

Q the problem amounts to predicting the class labels for y* = {yn+i, . . . ,yn+m} with 

^ corresponding feature vectors x* = {xn+i, ■ ■ ■ ,Xn+m}- 

^ Perhaps the most common approach to classification is the well-known A;— nearest 

^ neighbours (A;— nn) algorithm. This algorithm amounts to classifying an unlabelled 

^ yn+i as the most common class among the k nearest neighbours of Xn+i in the training 

set {{xi,yi),i = 1, . . . ,n}. While this algorithm is easy to implement, and often gives 
good performance, it can be criticised since it does not allow any uncertainty to be 
associated to the test class labels, and to the value to k. Indeed the choice of k is 
crucial to the performance of the algorithm. The value of k is often chosen on the 
basis of leave-one-out cross-validation. 

There has been some interest in extending the Ac— nearest neighbours algorithm to 



allow for uncertainty in the test class labelling, most notably by (Holmes and Adams 



2002), (Holmes and Adams 2003) and more recently (Cucala et al 2009). Each of 
these probabilistic variants of the k— nearest neighbour algorithm, is based on defining 
a neighbourhood of each point Xi, consisting of the k nearest neighbours of Xj. But 
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moreover, each of these neighbouring points has equal influence in determining the 
missing class label for regardless of distance from Xj. In this article we present 
a class of models, the distance nearest neighbour model, which shares many of the 
advantages of these probabilistic approaches, but in contrast to these approaches, the 
relative influence of neighbouring points depends on the distance from x,. Formally, 
the distance nearest neighbour model is a discrete-valued Markov random field, and, 
as is typical with such models, depends on an intractable normalising constant. To 
overcome this problem we use the exchange algorithm of Murray et al. (2006) and 
illustrate that this provides a computationally efficient algorithm with very good mix- 
ing properties. This contrasts with the difficulties encountered by Cucala et al. (2009) 
in their implementation of the sampling scheme of M0ller et al (2006). 

This article is organised as follows. Section 2 presents a recent overview of re- 
cent probabilistic approaches to supervised classification. Section 3 introduces the 
new distance nearest neighbour model and outlines how it compares and contrasts to 
previous probabilistic nearest neighbour approaches. We provide a computationally 
efficient framework for carrying out inference for the distance nearest neighbour model 
in Section 4. The performance of the algorithm is illustrated in Section 5 for a variety 
of benchmark datasets, as well as challenging high-dimensional datasets. Finally, we 
present some closing remarks in Section 6. 



2 Probabilistic nearest neighbour models 



Holmes and Adams (2003) attempted to place the fc— nn algorithm in a probabilistic 



setting therefore allowing for uncertainty in the test class labelling. In their approach 
the full-conditional distribution for a training label is written as 



'^{yi\y~i,x,^,k) oc exp f3^I{yi = yj)/k 



where the summation is over the k nearest neighbours of Xi and where I{yi = yj) is 
an indicator function taking the value 1 if ?/j = yj and otherwise. The notation, 
j i means that Xj is one of the k nearest neighbours of Xi. However, as pointed 
out in (Cucala et al 2009), there is a difficulty with this formulation, namely that 



there will almost never be a joint probability for y corresponding to this collection 
of full-conditionals. The reason is simply because the k—nn neighbourhood system is 
usually asymmetric. If Xi is one of the k nearest neighbours of Xj, then it does not 
necessarily follow that Xj is one of the k nearest neighbours of Xj. 



Cucala et al. (2009) corrected the issue surrounding the asymmetry of the k-nn 



neighbourhood system. In their probabilistic k-nn (pk-nn) model, the full-conditional 
for class label yi appears as 



T^iyi\y-i,x,(3,k) oc exp (3/k I ^I{yi = yj) + 




(1) 
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and this gives rise to the joint distribution 



'K{y\x,/3,k) oc exp f3/k^^I{yi = yj) 



Therefore under this model, following ([T]), mutual neighbours are given double weight, 
with respect to non-mutual neighbours and for this reason the model could be seen, 
perhaps, as an ad-hoc solution to this problem. 



It is important to also note that both Holmes and Adams (2002) and Cucala et 



al. (2009) allow the value of /c to be a variable. Therefore the neighbourhood size can 



vary. Holmes and Adams (2002) argue that allowing k to vary has a certain type of 
smoothing effect. 



3 Distance nearest neighbours 



Motivated by the work of Holmes and Adams (2002) and Cucala et al. (2009) our 
interest focuses on modelling the distribution of the training data as a Markov random 
field. Similar to these approaches, we consider a Markov random field based approach, 
but in contrast our approach explicitly models and depends on the distances between 
points in the training set. Specifically, we define the full-conditional distribution of 
the class label yi as 

7r(|/i|2/_i, X, (3, a) oc exp I /3 ^ w]I{yj = yi 

Positive values of the Markov random field parameter P encourage aggregation of the 
class label. When /3 = 0, the class labels are uncorrelated. In contrast to the pk-nn 
model, here the neighbourhood set of Xi is constructed to be 

and is therefore of maximal size. We consider three possible models depending on how 
the collection of weights {wl} for j = 1, . . . , i — 1, i + 1, . . . , n are defined. 



1. d—rnii. 

wl oc exp I ^ for j = 1,. . . ,i - + 1, 

where d is & distance measure such as Euclidean. 

2. d—mi2: 



w: 



oc e + (1 — e)I{d{xi, Xj) < a), for j = 1, . . . ,i — l,i + 1, 



again where, I is an indicator function taking value 1, if d{xi,Xj) < a and 0, 
otherwise. Further, e G (0, 1) is defined as a constant, and is set to a value close 
to 0. (Throughout this paper we assign the value e = 10~^°.) A non-zero value 
of e guarantees that if there are no features within a distance a of Xi then the 
class of yi is modelled using the marginal proportions of the class labels. 
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3. d— nna: 



oc exp {—d{xi, Xj)a} , ior j — 1, . . . ,i — l,i + 1, . . . ,n. 



Clearly the neighbour system for both models is symmetric, and so the Hammersley- 
Clifford theorem guarantees that the joint distribution of the class labels is a Markov 
random field. This joint distribution is written as 

z{/3,a) z{/3,a) 

As usual, the normalising constant of such a Markov random field is difficult to 
evaluate in all but trivial cases. It appears as 

z{/3, <7) = 5] • • • 5] exp ( /3 fl -y^- (3) 

yi Vn \ i j=i,j¥'i / 

Some comments: 

1. The A;— nn algorithm and its probabilistic variants always contain neighbour- 
hoods of size k, regardless of how far each of neighbouring points are from the 
center point, Xj. Moreover, each neighbouring point Xj has equal influence, re- 
gardless of distance from Xj. It could therefore be argued that these algorithms 
are not sensitive to outliers. By contrast the distance nearest neighbour models 
deal with outlying points in a more robust manner, since if a point Xj lies further 
away from other neighbours of Xi, then it will have a relatively smaller weight, 
and consequently less influence in determining the likely class label of yi. 

2. The formulation of distance nearest neighbour models includes every training 
point in the neighbourhood set, but the value of a determines the relative influ- 
ence of points in the neighbourhood set. For the d— nni model, small values of 
a imply that only those points with small distance from the centre point will be 
influential, while for large values of a, points in the neighboTirhood set are more 
uniformly weighted. Similarly, for the nn2 model, points within a a radius of 
the center point are weighted equally, while those outside a a radius of the center 
point will have relatively httle weight, when e is very close to 0. By contrast, for 
the d—nn^ model, large values of the parameter a imply that points close to the 
centre point will be influential. 

3. For the d— nn2 model, if there are no features in the training set within a distance 
a oi Xi, then 

^(Vi = j\y-i, X, P, a) oc exp {/3p]) , for j = 1, . . . , G, 

where denotes the proportion of class labels j in the set y \ {vi}. The pa- 
rameter f5 determines the dependence on the class proportions. A large value of 
P typically predicts the class label to be the class with the largest proportion, 
whereas a small value of /3 results in a prediction which is almost uniform over 
all possible classes. Conversely, if there any feature vectors within a radius a of 
Xi, then the class labels for these features wiU most influence the class label of 
Vi- 



4. As /3 — )■ oo, the most frequently occurring training label in the neighbourhood 
of a test point will be chosen with increasing large probability. The /3 parameter 
can be thought of, in a sense, as a tempering parameter. In the limit as /3 — oo, 
the modal class label in the neighbourhood set has probability 1. 

There has been work on extending the /c— nearest neighbours algorithm to weight 



neighbours within the neighbourhood of size k. For example, (Dudani 1976) weighted 
neighbours using the distance in a linear manner while standardizing weights to lie in 
[0,1]. 



A model similar to the d—nni model appeared in (Zhu and Ghahramani 2002), but 
it does not contain the /3 Markov random field parameter to control the level of aggre- 
gation in the spatial field. Moreover, the authors outline some MCMC approaches, but 
note that inference for this model is challenging. The aim of this paper is to illustrate 
how this model may be generalised and to illustrate an efficient algorithm to sample 
from this model. We now address the latter issue. 



4 Implementing the distance-nearest neighbours al- 
gorithm 

Throughout we consider a Bayesian treatment of this problem. The posterior distri- 
bution of test labels and Markov random field parameters can be expressed as 

-^{y*, (r\y, X, X*) oc n{y, y*\P, a, x, x*)'K{/3)7r{a), 

where 7r(/3) and 7r((T) are prior distributions for (3 and a, respectively. Note, however 
that the first term on the right hand side above depends on the intractable normalising 
constant ([S]). In fact, the number of test labels is often much greater than the num- 
ber of training labels, and so the resulting normalising constant for the distribution 
7T{y,y*\P,a,x,x*) involves a summation over (7"+™ terms, where as before n, m and G 
are the number of test data points, training data points and class labels, respectively. 
A more pragmatic alternative is to consider the posterior distribution of the unknown 
parameters for the training class labels, 

TT{l3,a\x,y) oc TT{y\l3,a,x)n{l3)n{a), 

where now the normalising constant depends on terms. Test class labels can then 
be predicted by averaging over the posterior distribution of the training data. 

Obviously, this assumes that the test class labels, y* are mutually independent, given 
the training data, which will typically be an unreasonable assumption. The training 
class labels are modelled as being mutually independent. Clearly, this is not ideal from 
the Bayesian perspective. Nevertheless, it should reduce the computational complexity 
of the problem dramatically. 
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In practice, we can estimate the predictive probability of i/n+i as an ergodic average 

1 ^ 



where (3^^\a^^^ are samples from the posterior distribution 7i{/3, a\x,y). 
4.1 Pseudolikelihood estimation 

A standard approach to approximate the distribution of a Markov random field is to use 



a pseudolikelihood approximation, first proposed in (Besag 1974). This approximation 
consists of a product of easily normalised full-conditional distributions. For our model, 
we can write a pseudolikelihood approximation as 

- n exp (/3 EJ=ljVi ^yiVj = y^ 

n{y\x,f3,a) ^ [[7i{yi\y_i,x, f3,(T) = [[ 



1=1 



-I E?=iexp (/3E"=ijyiW^K(l/j = ^)) 



This approximation yields a fast approximation to the posterior distribution, however 
it does ignore dependencies beyond first order. 

4.2 The exchange algorithm 

The main computational burden is sampling from the posterior distribution 

7i{l3,a\x,y) oc 7r{y\(3,a,x)n{(3)n{a) 
q{y\l3,a,x) 



-7r(/3)7r(a). 



A naive implementation of a Metropolis-Hastings algorithm proposing to move from 
0") to (/?', 0"') would require calculation of the following ratio at each sweep of the 
algorithm 

q{y\P',a',x)7r{/3')n{a') ^ z{l3,a) 
q{y\P,a,x)7r{l3)7i{a) z{P',a'y ^ ^ 

The intractability of the normalising constants, z{P,a) and z{P',a'), makes this algo- 
rithm unworkable. There has been work which has tackled the problem of sampling 



from such complicated distributions, for example, (M0ller et al 2006). The algorithm 
presented in this paper overcomes the problem of sampling from a distribution with 
intractable normalising constant, to a large extent. However the algorithm can result 



in an MCMC chain with poor mixing among the parameters. The algorithm in (M0ller 



2006). 



et al 2006) has been extended and improved in (Murray, Ghahramani and MacKay 

The algorithm samples from an augmented distribution 

7r(/3', a', y', a, f3\y, x) oc 7r{y\(3, a, x)7r{f3)n{a)h{(i' , a'\f3, a)n{y'\(i', a', x), (5) 



where 7i{y'\f3', a', x) is the same distance nearest-neighbour distribution as the training 
data y. The distribution h{(3', a'\(3, a) is any arbitrary distribution for the augmented 
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variables {I3',a') which might depend on the variables (/3,cr), for example, a random 
walk distribution centred at {(3, a). It is clear that the marginal distribution of ([s]) for 
variables a and (3 is the posterior distribution of interest. 
The algorithm can be written in the following concise way: 

1. GiBBS UPDATE OF {/3' , a' , y') : 

(I) Draw ~ M", "I/?,^)- 

(ii) Draw y' ~ -K{-\(3',a',x). 

2. Propose to move from {j3,a,y),{l3',a',y') to {(3\a\y),{(3,a,y'). (Ex- 
change move) with probability 

. ( q{y'\P, xy{{i')Ti{a')h{{i, a')q{y\{i' , a\ x) z{P,a)z{P\a') \ 
""'""V ' q{y\P,<y,x)7i{P)Ti{<j)h{P',a'\P,a)q{y'\P\a',x) z{P,a)z{P' ,a') ) ' 



Notice in Step 2, that all intractable normalising constants cancel above and below 
the fraction. The difficult step of the algorithm in the context of the d—mi model is 



Step 1 (ii), since this requires a draw from 7r(y'|/3', a\ x). Perfect samphng (Propp and 



Wilson 1996[ ) is often possible for Markov random field models, however a pragmatic 



alternative is to sample from 7r(-|/3', a', x) by standard MCMC methods, for example, 
Gibbs sampling, and take a realisation from a long run of the chain as an approximate 



draw from the distribution. Note that this is the approach that Cucala et al. (2009) 
take. They argue that perfect sampling is possible for the pk—mi algorithm for the 
case where there are two classes, but that the time to coalescence can be prohibitively 
large. They note that perfect sampling for more than two classes is not yet available. 
Note that this algorithm has some similarities with Approximate Bayesian Com- 



putation (ABC) methods (Sisson, Fan and Tanaka 2007) in the sense that ABC al- 



gorithms also rely on drawing exact values from analytically intractable distributions. 
By contrast however, ABC algorithms rely on comparing summary statistics of the 
auxiliary data to summary statistics of the observed data. Finally, note that the 
Metropolis-Hastings ratio in step 2 above, after re-arranging some terms, and assum- 
ing that /i(/3, cr|/3', 0"') is symmetric can be written as 

q{y\l5' ,a' ,x)T:{l3')n{a')q{y'\l3,a,x) 



q{y\(3,a,x)Ti{P)Ti{a)q{y'\l5\a',x) 



Comparing this to Q, we see that the ratio of normalising constants, z{(3^ a)/ z{(3\ a'), 
is replaced by q{y'\(3, a, x)/q{y'\f3', a', x), which itself can be interpreted as an impor- 
tance sampling estimate of z{(3,a)/z{(3' ,a'), since 



'Ey'ip'y 



q{y'\f3,a,x) ] f q{y'\f3,(r,x) q{y'\fi' ,(r' ,x) , z{f3,(r) 

dy 



.q{y'\/3',a',x)l J q{y'\P',a',x) z{P',a') ^ z{/3',a')- 

5 Results 



The performance of our algorithm is illustrated in a variety of settings. We begin 
by testing the algorithm on a collection of benchmark datasets and follow this by 
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exploring two real datasets with high-dimensional feature vectors. Matlab computer 
code and all of the datasets (test and training) used in this paper can be found at 
mathsci . ucd . ie/ ~nial/ dnn/. 

5.1 Benchmark datasets 

In this section we present results for our model and in each case we compare results 
with the k—nn algorithm for well known benchmark datasets. A summary description 
of each dataset is presented in Table [T] 





G 


F 




Pima 


2 


8 


532 


Forensic glass 


4 


9 


214 


Iris 


3 


4 


150 


Crabs 


4 


5 


200 


Wine 


3 


13 


178 


Olive 


3 


9 


572 



Table 1: Summary of the benchmark datasets: G, F, N correspond to the number of 
classes, the number of features and the overall number of observations, respectively. 

In all situations, the training dataset was approximately 25% of the size of the 
overall dataset, thereby presenting a challenging scenario for the various algorithms. 
Note that the sizes of each dataset ranges from quite small in the case of the iris 
dataset, to reasonably large in the case of the forensic dataset. In all examples, the 
data was standardised to give transformed features with zero mean and unit variance. 
In the Bayesian model, non-informative A^(0, 50^) and U{0, 100) priors were chosen 
for /3 and a, respectively. Each d—nn algorithm was run for 20, 000 iterations, with 
the first 10, 000 serving as burn-in iterations. The auxiliary chain within the exchange 
algorithm was run for 1, 000 iterations. The A;— nn algorithm was computed for values 
of k from 1 to half the number of features in the training set. In terms of computational 
run time, the ci— nn algorithms took, depending on the size of the dataset, between 1 
to 12 hours to run using Matlab code on a 2GHz desktop machine. 

A summary of misclassification error rates is presented in Table |2] for various bench- 
mark datasets. In almost all of the situations d—nui and d—nn^ performs at least as 
well as k—nn and often considerably better. In general, d—nni and d—nn^ performed 
better than d— nn2. A possible explanation for this may be due to the cut-off nature 
of the weight function in the d—nn2 model, since if a point Xi has no neighbours inside 
a ball of radius a, then wl is uniform over the entire test set, and consequently there 
is no effect of distance. By contrast, both the d—nni and d—nn^ models, have weight 
functions which depend on distance, and smoothly converge to a uniform distribution 
as o" — 7- oo and o" — )■ 0, respectively. 
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A;— nn 


d—nni 


d—nn2 


d—nn^ 


rima 


30% 


29% 


32% 


30% 


Forensic glass 


35% 


33% 


39% 


31% 


Iris 


6% 


5% 


5% 


6% 


Crabs 


16% 


16% 


23% 


16% 


Wine 


6% 


4% 


6% 


4% 


Olive 


1% 


3% 


4% 


2% 



Table 2: Misclassification error rates for various benchmark dataset. The value of 
k in the k—nn algorithm was chosen as the value that minimises the leave-one-out 
cross-validation error rate. (In the case of a tie, the smallest value of k was selected.) 



5.2 Classification with large feature sets: food authenticity 

Here we consider two datasets concerned with food authentication. The first example 
involves samples of Greek olive oil from 3 different regions, and the second example 
involves samples of 5 different types of meat. In both situations each sample was 
analysed using near infra-red spectroscopy giving rise to 1050 reflectance values for 
wavelengths in the range 400 — 2098nm. These 1050 reflectance values serve as the 
feature vector for each sample. The objective in both examples is to authenticate a 
test sample based on a training set of complete data (both reflectance values and class 



labels). Details of how both datasets were collected appear in (McElhinney et al 1999), 



and were analysed using a model-based clustering approach in ( Dean et al 2006 ) . 



5.2.1 Classifying meat samples 

Here 231 samples of meat were collected. The aim of this study was to see if these 
measurements could be used to classify each meat sample according to whether it is 
chicken, turkey, pork, beef or lamb. The data were randomly split into 60 training 
samples and 171 test samples. The respective number of samples in each class is given 
in the table below. 



Training Test 



Chicken 


15 


40 


Turkey 


20 


35 


Pork 


13 


42 


Beef 


11 


21 


Lamb 


11 


23 



Table 3: Number of samples within each class for both the training and test datasets 

As before, non-informative normal, A^(0, 50^) and uniform ?7(0, 10) priors were 
chosen for /3 and a, respectively. In the exchange algorithm, the auxiliary chain was 
run for 1000 iterations, and the overall chain ran for 20, 000 of which the first 10, 000 
were discarded as burn-in iterations. The overall acceptance rate for the exchange 
algorithm was around 25% for each of the rf— nn models. 
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The misclassification error rate for leave-one-out cross-validation on the training 
dataset is minimised for = 3 and /c = 4. See Figure [l] (a). At both of these values, 
the k—nn algorithm yielded a misclassification error rate of 35% and 39%, respectively, 
for the test dataset. See Figure [l] (b) . By comparison, the d— nni, d— nn2 and d— 10.10.3 
models achieved misclassification error rates of 29%, 33% and 27%, respectively, for 
the test dataset. This example further illustrates the value of the ci— nn models. 




1 2 3 4 5 6 7 8 9 in 1 2 3 4 5 6 7 8 9 10 



k k 
(a) (b) 

Figure 1: Meat dataset: (a) Training data: misclassification rates of leave-one-out 
cross-validation for A;— nn algorithm for varying values of k. (b) Test data: misclassi- 
fication rates for k—nn algorithm for varying values of k. 



5.2.2 Classifying Greek olive oil 

This example concerns classifying Greek oil samples, again based on infra-red spec- 
troscopy. Here 65 samples of Greek virgin olive-oil were collected. The aim of this 
study was to see if these measurements could be used to classify each olive-oil sam- 
ple to the correct geographical region. Here there were 3 possible classes (Crete (18 
locations), Peloponnese (28 locations) and other regions (19 locations). 

In our experiment the data were randomly split into a training set of 25 observations 
and a test set of 40 observations. In the training dataset the proportion of class labels 
was similar to that in the complete dataset. 

In the Bayesian model, non-informative A^(0, 50^) and f/(0, 100) priors were chosen 
for /3 and a. In the exchange algorithm, the auxiliary chain was run for 1000 iterations, 
and the overall chain ran for 50, 000 of which the first 20, 000 were discarded as burn-in 
iterations. The overall acceptance rate for the exchange algorithm was around 15% 
for each of the Markov chains. 

The d—nni, d—nn2 and d—nn^ models achieved misclassification rates of 20%, 26% 
and 20%, respectively. In terms of comparison with the A;— nn algorithm, leave-one-out 
cross-validation was minimised for = 3 for the training dataset. See Figure [2] (a). 
The misclassification rates at this value of k was 29% for the test dataset. See Figure [2] 
(b). 

It is again encouraging that the d—nn algorithms yielded improved misclassification 
rates by comparison. 
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Figure 2: Olive oil dataset: (a) Training data: misclassification rates of leave-one-out 
cross-validation for k—nn algorithm for varying values of k. (b) Test data: misclassi- 
fication rates for A;— nn algorithm for varying values of k. 



6 Concluding remarks 

In terms of providing a probabilistic approach to a Bayesian analysis of supervised 
learning, our work builds on that of Cucala et al (2009) and shares many of the 



advantages of the approach there, providing a sound setting for Bayesian inference. The 
most likely allocations for the test dataset can be evaluated and also the uncertainty 
that goes with them. So this makes it possible to determine regions where allocation 
to specific classes is uncertain. In addition, the Bayesian framework allows for an 
automatic approach to choosing weights for neighbours or neighbourhood sizes. 

The present paper also addresses the computational difficulties related to the well- 
known issue of the intractable normalising constant for discrete exponential family 
models. While Cucala et al (2009) demonstrated that MCMC sampling is a practi- 
cal alternative to the perfect sampling scheme of M0ller et al (2006), there remain 
difficulties with their implementation of the approach of (M0ller et al 2006), namely 



the choice of an auxiliary distribution. To partially overcome the difficulties of a poor 



choice, Cucala et al (2009) use an adaptive algorithm where the auxiliary distribution 



is defined by using historical values in the Monte Carlo algorithm. We use an alterna- 
tive approach based on the exchange algorithm which avoids this choice or adaptation 
and has very good mixing properties and therefore also has computational efficiency. 



An issue with the neighbourhood model of Cucala et al (2009), which is an Ising 



or Boltzmann type model, is that it is necessary to define an upper value for the asso- 
ciation parameter /3. This parameter value arises from the phase change of the model 
and which is known for a regular neighbourhood structure but has to be investigated 
empirically for the probabilistic neighbourhood model. Our distance nearest neighbour 
models avoid this difficulty. 

Our approach is robust to outliers whereas the nearest neighbour approaches will 
always have an outlying point having neighbours and therefore classified according to 
assumed independent distant points which are the nearest neighbours. 
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